Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13139][SQL] Parse Hive DDL commands ourselves #11573

Closed
wants to merge 17 commits into from

Conversation

andrewor14
Copy link
Contributor

What changes were proposed in this pull request?

This patch is ported over from @viirya's changes in #11048. Currently for most DDLs we just pass the query text directly to Hive. Instead, we should parse these commands ourselves and in the future (not part of this patch) use the HiveCatalog to process these DDLs. This is a pretext to merging SQLContext and HiveContext.

Note: As of this patch we still pass the query text to Hive. The difference is that we now parse the commands ourselves so in the future we can just use our own catalog.

How was this patch tested?

Jenkins, new DDLCommandSuite, which comprises of about 40% of the changes here.

buckets: Option[BucketSpec],
// TODO: use `clustered` and `sorted` instead for simplicity
noClustered: Boolean,
noSorted: Boolean)(sql: String)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya was there any reason why these have to be negative? It's much easier to understand if it's positive, i.e. clustered and sorted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because the corresponding token is TOK_NOT_CLUSTERED and TOK_NOT_SORTED. Yea we can use positive here.

@andrewor14
Copy link
Contributor Author

Note: The only changes I made on top of #11048 is addressing the outstanding comments in that patch and some minor clean ups. It's entirely possible that there still are things that are missing or incorrect given the original patch was not reviewed completely yet.

@hvanhovell @yhuai PTAL.

@SparkQA
Copy link

SparkQA commented Mar 8, 2016

Test build #52630 has finished for PR 11573 at commit a663b5c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -64,6 +85,53 @@ private[sql] class SparkQl(conf: ParserConf = SimpleParserConf()) extends Cataly
val tableIdent = extractTableIdent(nameParts)
RefreshTable(tableIdent)

case Token("TOK_CREATEDATABASE", Token(databaseName, Nil) :: createDatabaseArgs) =>
val Seq(
allowExisting,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't allowExists the exact opposite of IF NOT EXISTS? I am asking because I have a similar case in my PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite understand your question?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When i see a parameter that is named allowExists I think the existence of the current table is allowed, and that as a consequence the current create table command is going to overwrite the existing table. Why not name this ifNotExists?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed it.

This commit cleans up the case class signatures representing
ALTER TABLE commands along with the code that parses them.
This commit also adds a lot of missing documentation that
was sorely needed for AlterTableCommandParser to be readable.

The cleanup in this commit concerns only ALTER TABLE commands,
but is expected to spread to other parts of the PR in subsequent
commits.
@andrewor14
Copy link
Contributor Author

@hvanhovell I've addressed most of your comments except a couple ones where I said I would fix later. Separately I've also significantly cleaned up the logic and signatures in the alter table code so you should have a look at that as well. It's probably quite different from what it looked like when you last reviewed this patch.

I'll continue the cleanup in the rest of the patch shortly. I'm marking this as WIP in the mean time.

@andrewor14 andrewor14 changed the title [SPARK-13139][SQL] Parse Hive DDL commands ourselves [SPARK-13139][SQL][WIP] Parse Hive DDL commands ourselves Mar 9, 2016
@andrewor14 andrewor14 changed the title [SPARK-13139][SQL][WIP] Parse Hive DDL commands ourselves [SPARK-13139][SQL] Parse Hive DDL commands ourselves Mar 9, 2016
@andrewor14
Copy link
Contributor Author

@hvanhovell I believe as of the latest commit I've addressed all of your comments. This is ready from my side so I've removed the WIP tag. PTAL.

@SparkQA
Copy link

SparkQA commented Mar 10, 2016

Test build #52773 has finished for PR 11573 at commit 6ad8dd5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor Author

Also cc @rxin @yhuai.

@SparkQA
Copy link

SparkQA commented Mar 10, 2016

Test build #52770 has finished for PR 11573 at commit 37854fc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

*/
private[sql] case class BucketSpec(
numBuckets: Int,
bucketColumnNames: Seq[String],
sortColumnNames: Seq[String])
sortColumns: Seq[(String, SortDirection)]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is great to have SortDirection defined explicitly. So, when we create it we know what will the order by.

However, this change implies that we can use descending as the order, which is actually not allowed because UnsafeKVExternalSorter only sort keys with ascending order (this sorter is used in DynamicPartitionWriterContainer when we generate bucketing files). So, I think for now, it is better to revert this change. In future, we can revisit it when we store sort direction in metastore and we can actually sort rows in a bucket file with a descending order.

}
case _ => parseFailed("Invalid CREATE FUNCTION command", node)
}.toMap
CreateFunction(funcName, alias, resourcesMap, temp.isDefined)(node.source)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will this map look like if I have USING JAR 'jar1', JAR 'jar2', ...?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also have a test for this case.

@yhuai
Copy link
Contributor

yhuai commented Mar 11, 2016

I have left a few comments. It is a good starting point. Thank you for working on this!

@SparkQA
Copy link

SparkQA commented Mar 11, 2016

Test build #52937 has finished for PR 11573 at commit bd91b0f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Mar 11, 2016

OK. Let's merge this first (to avoid it has conflicts caused by other commits ). We can address the comments in a follow-up PR.

@asfgit asfgit closed this in 66d9d0e Mar 11, 2016
@andrewor14 andrewor14 deleted the parser-plus-plus branch March 11, 2016 23:31
ghost pushed a commit to dbtsai/spark that referenced this pull request Mar 14, 2016
Addressing outstanding comments in apache#11573.

Jenkins, new test case in `DDLCommandSuite`

Author: Andrew Or <[email protected]>

Closes apache#11667 from andrewor14/ddl-parser-followups.
asfgit pushed a commit that referenced this pull request Mar 17, 2016
## What changes were proposed in this pull request?

As part of the effort to merge `SQLContext` and `HiveContext`, this patch implements an internal catalog called `SessionCatalog` that handles temporary functions and tables and delegates metastore operations to `ExternalCatalog`. Currently, this is still dead code, but in the future it will be part of `SessionState` and will replace `o.a.s.sql.catalyst.analysis.Catalog`.

A recent patch #11573 parses Hive commands ourselves in Spark, but still passes the entire query text to Hive. In a future patch, we will use `SessionCatalog` to implement the parsed commands.

## How was this patch tested?

800+ lines of tests in `SessionCatalogSuite`.

Author: Andrew Or <[email protected]>

Closes #11750 from andrewor14/temp-catalog.
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Mar 17, 2016
Addressing outstanding comments in apache#11573.

Jenkins, new test case in `DDLCommandSuite`

Author: Andrew Or <[email protected]>

Closes apache#11667 from andrewor14/ddl-parser-followups.
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
## What changes were proposed in this pull request?

This patch is ported over from viirya's changes in apache#11048. Currently for most DDLs we just pass the query text directly to Hive. Instead, we should parse these commands ourselves and in the future (not part of this patch) use the `HiveCatalog` to process these DDLs. This is a pretext to merging `SQLContext` and `HiveContext`.

Note: As of this patch we still pass the query text to Hive. The difference is that we now parse the commands ourselves so in the future we can just use our own catalog.

## How was this patch tested?

Jenkins, new `DDLCommandSuite`, which comprises of about 40% of the changes here.

Author: Andrew Or <[email protected]>

Closes apache#11573 from andrewor14/parser-plus-plus.
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
Addressing outstanding comments in apache#11573.

Jenkins, new test case in `DDLCommandSuite`

Author: Andrew Or <[email protected]>

Closes apache#11667 from andrewor14/ddl-parser-followups.
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
## What changes were proposed in this pull request?

As part of the effort to merge `SQLContext` and `HiveContext`, this patch implements an internal catalog called `SessionCatalog` that handles temporary functions and tables and delegates metastore operations to `ExternalCatalog`. Currently, this is still dead code, but in the future it will be part of `SessionState` and will replace `o.a.s.sql.catalyst.analysis.Catalog`.

A recent patch apache#11573 parses Hive commands ourselves in Spark, but still passes the entire query text to Hive. In a future patch, we will use `SessionCatalog` to implement the parsed commands.

## How was this patch tested?

800+ lines of tests in `SessionCatalogSuite`.

Author: Andrew Or <[email protected]>

Closes apache#11750 from andrewor14/temp-catalog.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants